Loading libraries for data import, analysis, plotting, distribution fitting, string matching, pretty printing data frames and tables
library(dplyr)
library(data.table)
library(plotly)
library(MASS)
library(stringr)
library(knitr)
Logistics plays a more and more important role in the product development of the automobile industry. Parts produced by the supplier must first be delivered to the OEM before they can be installed. What seems logical at first sight should be analyzed in more detailed way for a professional application. Therefore, create a distribution for the logistics delay of component „K7”. Use the production date (“Produktionsdatum”) from the data set “Komponente_K7.csv” and the receiving date of incoming goods (“Wareneingang”) from “Logistikverzug_K7.csv” (logistics delay). You can assume that produced goods are issued one day after production date. For the model design in R, create a new data set “Logistics delay” that contains the required information from both data sets.
Importing the data files. Komponente_K7.csv contains the production dates of the components, while Logistikverzug_K7.csv contains the date of arrival for the next production stage. The first column of both files is dropped since it contains only line numbers. Information about defects is included as integers with the values 0 (false) or 1 (true). fread() interprets this as an integer, so it is converted to a logical value (TRUE or FALSE).
production <- fread(file.path("Data","Logistikverzug","Komponente_K7.csv"), header=TRUE, drop=1) %>%
mutate(Fehlerhaft = as.logical(Fehlerhaft))
arrival <- fread(file.path("Data","Logistikverzug","Logistikverzug_K7.csv"), header=TRUE, drop=1) %>%
mutate(Fehlerhaft = as.logical(Fehlerhaft))
The two tables with the data imported from the files are now joined into one along the vehicle ID (“IDNummer”).
logistics <- production %>%
inner_join(arrival,by="IDNummer",suffix=c(".Komponente",".OEM"))
Since part manufacturer, OEM and logistics provider work seven days a week, weekends are of no importance to the logistic delay. Therefore, it can be calculated as simply the number of days between production and arrival at OEM (reduced by one, since the component is dispatched on the day after production).
logistics <- logistics %>%
mutate(logistic_delay = as.numeric(Wareneingang) - as.numeric(Produktionsdatum))
mean_delay <- mean(logistics$logistic_delay)
The (arithmetic) mean logistic delay is calculated to be 7.0804366. On average, the components will be on the road for one week. As an alternative, the median will be computed to make the analysis less susceptible to outliers with an unusually long logistic delay like the ones caused by the delivery problems during the COVID-19 pandemic. This is especially problematic because outliers to the right cannot be compensated by outliers to the left, since there can be no negative logistic delay (provided there are no significant breakthroughs in time travel research).
median(logistics$logistic_delay)
## [1] 7
The logistic delay is assumed to be either normally or logarithmic normally distributed. This assumption stems from a preliminary analysis, which led to the conclusion that the density function looks most similar to these two distributions. Both distributions are fitted to the data with the fitdistr() function from the MASS library.
lognorm_fit <- fitdistr(logistics$logistic_delay, densfun="lognormal")
norm_fit <- fitdistr(logistics$logistic_delay, densfun="normal")
The data is plotted as a histogram and as a line plot of the density function, together with the density functions of the normal and log normal distribution, as generated by the functions dlnorm() (for log normal) and dnorm() (for normal distribution). While the histogram can be created completely automatically, the density function, which looks very similar, needs some preparatory work: The data is grouped by logistic delay and then counted, which results in a table that lists how often a certain delay occurred. This is possible, because the delay can only have discrete, integer values (whole days).
The plot shows relative frequency, i.e. how often a certain delay occurs relative to the total number of deliveries. This means that the sum of the blue histogram bars and the area under the curve of the density functions are all 1.
Since the data only has an accuracy of one day, the automatically selected bins with a size of 1 day are sufficient. This leads to an acceptable number of 12 bins.
density <- logistics %>% group_by(logistic_delay) %>% count()
fig <- plot_ly(x = ~logistics$logistic_delay,
type="histogram",
histnorm='probability density',
name="Original data",
width=800, height=400) %>%
add_lines(x = ~density$logistic_delay,
y=~density$n/nrow(logistics),
name="Density function",
line=list(width=5)) %>%
add_lines(x=seq(4,12,0.1),
y=dlnorm(seq(4,12,0.1), meanlog = lognorm_fit$estimate["meanlog"], sdlog =lognorm_fit$estimate["sdlog"]),
name="Log normal distribution",
line=list(dash="dash",
color="black")) %>%
add_lines(x=seq(4,12,0.1),
y=dnorm(seq(4,12,0.1), mean = norm_fit$estimate["mean"], sd =norm_fit$estimate["sd"]),
name="Normal distribution",
line=list(dash="dot",
color="black")) %>%
layout(xaxis = list(dtick = 1, title="Logistic delay (in days)"),
yaxis = list(range=c(0,0.5), title="Relative frequency"))
fig
Both the normal and log normal distributions describe the data reasonably well and their plots follow the original density function plot closely. The plot does not allow a decision as to which distribution is the better fit. The results become much clearer, however, when examining the parameters of the fits.
lognorm_fit
## meanlog sdlog
## 1.9473378003 0.1409926029
## (0.0002546761) (0.0001800832)
norm_fit
## mean sd
## 7.080436556 1.012299959
## (0.001828526) (0.001292963)
The second line respectively shows the standard errors of the fits. Here it becomes apparent that the errors for the logarithmic normal distribution are nearly one order of magnitude smaller, signifying a more precise fit.
The provided data includes two columns to indicate if a component is defective, one for the component manufacturer side and one for the OEM side. Since they are both originally called “Fehlerhaft”, they are named “Fehlerhaft.Komponente” and Fehlerhaft.OEM" in the joined dataset.
If both of these columns agree, the decision is easy. However, there could be cases where the classification on the component manufacturer and the OEM side disagree. To find out if these cases exist, the logistics table is grouped by the two columns and then counted. This displays all present combinations of defect classifications and how often these combinations occur.
knitr::kable(logistics %>%
group_by(Fehlerhaft.Komponente,Fehlerhaft.OEM) %>%
count()
)
| Fehlerhaft.Komponente | Fehlerhaft.OEM | n |
|---|---|---|
| FALSE | FALSE | 306484 |
| TRUE | FALSE | 6 |
As we can see, most components were classified as “not defective” (FALSE) by the component manufacturer and the OEM. The 6 components that were classified as defective on the component manufacturer side were not classified as defective by the OEM. There are several possible explanations for this:
Since none of the sides can be completely trusted (the manufacturer might cheat and the OEM might be less familiar with the details of what exactly can be defective in a component), the part should be considered defective if either of the sides finds a defect. The resulting decision tree looks like this:
consider defective if either Fehlerhaft.Komponente or Fehlerhaft.OEM are TRUE
Why does it make sense to store the available data in separate files instead of saving everything in a huge table? Name at least four benefits. The available tables represent a typical data base structure. How is it called?
How many of the parts T16 ended up in vehicles registered in Adelshofen?
Part 16 is used in the components K2LE2 and K2ST2 (seats). Therefore, this analysis needs:
vehicle_component_files <- c(file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM1_Typ11.csv"),
file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM1_Typ12.csv"),
file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM2_Typ21.csv"),
file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM2_Typ22.csv"))
vehicle_component <- lapply(vehicle_component_files, fread, header=TRUE, drop=1) %>%
rbindlist()
vehicle_component_filtered <- vehicle_component %>%
filter(str_detect(ID_Sitze,"K2LE2") | str_detect(ID_Sitze,"K2ST2"))
The resulting data table, vehicle_component_filtered, has 818844 entries, so the components containing part 16 have been installed 818844 times.
reg_data <- fread(file.path("Data","Zulassungen","Zulassungen_alle_Fahrzeuge.csv"), header=TRUE, drop=1) %>%
mutate(Gemeinden = str_to_upper(Gemeinden))
reg_data_filtered <- reg_data %>%
filter(str_detect(Gemeinden,"ADELSHOFEN"))
These tables are now joined into one along the vehicle IDs. This final, complete table is called reg_veh_comp, because it links registration data with the corresponding vehicles and components.
reg_veh_comp <- reg_data_filtered %>%
inner_join(vehicle_component_filtered, by=c(IDNummer="ID_Fahrzeug"))
The complete table has 96 entries, so 96 vehicles using part 16 were registered in Adelshofen.
There is one caveat. Two municipalities in Germany are named “Adelshofen”. The larger one, Adelshofen in county Fürstenfeldbruck near Munich, is called “ADELSHOFEN” in the registration data, while the smaller one, Adelshofen in county Ansbach near Rothenburg ob der Tauber, is included as “ADELSHOFEN1”. If we group the table reg_veh_comp_part by the column “Gemeinden”, we can see how many of the vehicles were registered in each of the two towns.
knitr::kable(reg_veh_comp %>% group_by(Gemeinden) %>% count())
| Gemeinden | n |
|---|---|
| ADELSHOFEN | 48 |
| ADELSHOFEN1 | 48 |
We can see that in each of the two towns called “Adelshofen”, 48 vehicles with part 16 were registered.
Which data types do the attributes of the registration table “Zulassungen_aller_Fahrzeuge” have? Put your answers into a table which is integrated into your Markdown document and describe the characteristics of the data type(s).
The CSV file format does not technically prescribe data types, since everything is represented in text format and could theoretically be imported as characters, but the format of the data strongly implies certain data types. Import functions like fread() detect the implied formats and convert the data accordingly.
| Column name | Data type | Explanation |
|---|---|---|
| [Column 1] | int (Integer) | An integer is a non-fraction number, i.e. a number which can be divided by 1 without remainder. It can be positive or negative. |
| IDNummer, Gemeinden | chr (Character) | This data type is for data in text form. This can be any kind of text of arbitrary length. Since most data can be represented as text, this data type is very versatile. |
| Zulassung | IDate (integer-based date) | This contains a date, with a precision of one day. It is represented internally as an integer number of days since a certain defined “day 0”. |
You want to publish your application. Why does it make sense to store the records on the database of a server? Why can’t you store the records on your personal computer? What is an easy way to make your application available to your customers? Please name 4 aspects.
Storing the data on a local PC means the PC would have to function as a server, otherwise it cannot make the data available to other users. While software for this purpose is readily available, this would require the PC to be permanently running and online. Depending on the access count, a PC might also be underpowered to serve a large number of connections without disturbing the normal work of the owner. There are also security risks involved, because a malicious user could gain access to other, confidential and/or personal data stored on the PC.
For this reason, the application should be made accessible via a dedicated server. The shiny app framework is already based on HTML, so it can easily be made accessible to the public as a website. Hosting the application on a web server would help to prevent users from gaining direct access to the precise data on the logistics chain, which is likely a trade secret, since the file Final_Data_Group_11.csv is only available to the server software, but not accessible to the client.
On 11.08.2010 there was a hit and run accident. There is no trace of the license plate of the car involved in the accident. The police asks for your help, as you work for the Federal Motor Transport Authority, and asks where the vehicle with the body part number “K5-112-1122-79” was registered.
It is assumed the police request is for the purpose of criminal prosecution and therefore legal under § 98c StPO.
The relevant data files have already been imported for task 3. First, the column “ID_Karosserie” (which contains body part numbers) of table vehicle_component, linking vehicles to their components, is filtered for the given body part number, to find the corresponding vehicle ID.
accident_vehicle_components <- vehicle_component %>%
filter(ID_Karosserie == "K5-112-1122-79")
The vehicle in the accident was the one with the ID 12-1-12-82. To find it’s place of registration, the registration data in reg_data is now filtered for this ID.
accident_vehicle_registration <- reg_data %>%
filter(IDNummer == accident_vehicle_components$ID_Fahrzeug)
The vehicle involved in the accident was registered on 02 Jan 2009 in ASCHERSLEBEN.